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Abstract. At the heart of many scientific applications is the solution of algebraic systems, 
such as linear systems of equations, eigenvalue problems, and optimization problems, to name 
a few. TOPS, which stands for Towards Optimal Petascale Simulations, is a SciDAC applied 
math center focused on the development of solvers for tackling these algebraic systems, as well 
as the deployment of such technologies in large-scale scientific applications of interest to the 
U.S. Department of Energy. In this paper, we highlight some of the solver technologies we have 
developed in optimization and matrix computations. We also describe some accomplishments 
achieved using these technologies in UNEDF, a SciDAC application project on nuclear physics. 



1. Introduction 

Over the last couple of decades, simulation science has become as important as theoretical 
and experimental science. The success of simulation science hinges on the ability to perform 
the calculations efficiently. The inner most kernel in these calculations is often the solution 
of algebraic systems, including, but not limited to, systems of linear and nonlinear equations, 
eigenvalue problems, optimization problems, and sensitivity analysis. TOPS, which stands for 
Towards Optimal Petascale Simulations, is a multi- institutional SciDAC applied math center 
that focuses on the development of solvers for tackling these algebraic systems, as well as the 
deployment of such technologies in large-scale scientific applications, particularly those of interest 
to the U.S. Department of Energy. 

In this paper, we highlight two specific areas of TOPS: eigenvalue calculations and 
optimization. In particular, we highlight some accomplishments we have achieved in 
collaboration with computational physicists in UNEDF. The goal of the UNEDF SciDAC 
application project [1] is to obtain a comprehensive understanding of nuclei and their reactions 
based on the most accurate knowledge of the strong nuclear interaction. Eigenvalue calculations 
come up in the solution of the nuclear Schrodinger equation [21 [3]. The eigenvalues and the 
eigenvectors correspond to the energy states and wave functions. Numerical optimization 
techniques are needed in building the next generation of nuclear energy functionals, which will 



provide nuclear physicists better tools for predicting the properties and behavior of atomic nuclei 
over the entire nuclear table. 

2. Eigenvalue Calculations 

In nuclear configuration interaction calculation, it is sometimes necessary to investigate, among 
others, nuclear level densities as a function of the total angular momentum J and excitation 
energy, and to evaluate scattering amplitudes as a function of J [4j. We will refer to this as a 
total-J calculation in this paper. In this type of calculation, we are interested in computing a 
relatively large number of states with a prescribed J value. 

One brute-force approach to a total-J calculation is to simply compute a large number of 
eigenvalues and wave functions of a nuclear many-body Hamiltonian, for example in an M- 
scheme basis (good angular momentum projection along the z-axis), and select, among these 
wave functions, the ones that have a prescribed J value. This approach is appropriate when 
the number of desired energy states and wave functions is small (e.g., ten to twenty states). 
When that is not true, or when certain properties of a nucleus pertaining to a fixed J are 
to be calculated, the brute-force approach may require computing a very large number of 
wave functions, and the computational cost for performing this type of calculation may be 
prohibitively high. Furthermore, even if we can afford to perform this type of calculation, this 
may not be an efficient use of resources because we compute a large number of wave functions 
only to throw away most of them because they do not have the desired J value. 

We have developed an alternative approach where we construct an invariant subspace Z 
that contains all wave functions associated with a fixed J value in advance and project the 
nuclear many-body Hamiltonian into this subspace to produce a projected Hamiltonian with the 
minimum dimension consistent with that chosen J. A sparse matrix diagonalization procedure 
[5j [H [7] is then applied to this projected Hamiltonian to obtain the desired energy states and 
their corresponding wave functions. 

To construct Z, we need to work with the total angular momentum square operator J 2 and 
compute the null space of J 2 — XI, where A = J(J + 1) is a known eigenvalue of J 2 . 

When the many-body basis states associated with the configuration space are properly 
ordered and grouped, J 2 becomes block diagonal: J 2 = diag( J 2 , J 2 , J 2 g ). Therefore, the 

task of computing the desired null space of J 2 — XI reduces to that of computing the desired 
null spaces of J 2 — XI, for i = 1,2, ...,n g . 

However, because the dimensions of the J 2 's vary over a wide range (e.g., from 1 to more 
than 36,000 for 12 C, N max = 6), it is difficult to maintain a good load balance in the null space 
calculation. Here, N ma x is a parameter limiting the total number of oscillator quanta allowed 
in the many-body states. 

We developed a multi-level task and data distribution scheme to achieve optimal parallel 
performance in the null space calculation by 

(i) Limiting the granularity of the parallelism; that is, we try to divide the overall task into many small 
tasks of limited sizes so that good load balance arises from distributing these small tasks evenly 
among different processors. 

(ii) Limiting the communication overhead incurred in the null space calculation so that the overall time 
of the computation can be minimized. 

To achieve these inherently conflicting goals, we classified Jf blocks into small, medium 
and large groups based on the estimated computational loads associated with computing the 
desired null space of J 2 — XI, and the estimated ratio of communication volume to floating point 
operations count. 

The small J 2 blocks are distributed among all processors based on their computational load 
by a greedy algorithm. The null spaces of these matrices are computed by a sequential LAPACK 



rank-revealing QR subroutine. No communication is involved in these calculations. Each one 
of the medium-sized Jf blocks is assigned to a subgroup of processors by the same greedy 
algorithm. The null space calculation for such a block is parallelized among processors within 
the same subgroup, which will incur some communication overhead. Finally, the desired null 
space calculation for a large Jf block is carried out in parallel on all processors. 

We implemented three different algorithms for computing the null space of Jf — XI for medium 
and large blocks. 

(i) Randomized rank- revealing QR (RQR). The algorithm performs two standard QR factorizations of 
dense matrices without pivoting. Although we do not take advantage of the sparsity of Jf in this 
approach, it is more efficient than other approaches when the dimension of the desired null space is 
relatively large (e.g. larger than 10% of the dimension of Jf). 

(ii) Shift-invert Lanczos (SIL), which requires solutions of sparse linear systems. 

(iii) Polynomial accelerated subspace iteration (PASI) . We apply a standard subspace iteration jS] to the 
matrix p(Jf), where p(u>) is a polynomial that assumes the value of 1 at ui = A, and has a much 
smaller magnitude (than 1) in other parts of the spectrum of Jf. 

Table [T] shows that our load balance scheme is much better than a brute- force approach of 
distributing Jf in a cyclic fashion to different processors. Table [i] shows that PASI is more 
efficient when J=0. The randomized QR algorithm appears to be more efficient for larger J 
values. However, when J becomes very large, which typically results in smaller dimension of the 
null space, PASI becomes more efficient again. 
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Table 2. RQR decomposition vs PASI for 
different J values. Both methods use the 
greedy load balancing technique. Times are 
in seconds. 
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3. Optimization 

Optimization plays a central role in the building of next-generation nuclear energy functionals, 
including functionals based on density functional theory (DFT) and/or ab initio calculations. 
In the case of DFT-based functionals, for example, a primary computational bottleneck is 
determining parameter values so that the functional agrees with data on a set of observables 
such as binding energies, radii, and odd-even staggering [9j. Mathematically, we need to solve 
the optimization problem 

mmi/(x) = ^( — — 3 9% > x j :lj < Xj <uj, j = l,...,n\ , (1) 




Figure 1. Exploiting structure in param- 
eter estimation problems substantially re- 
duces the number of required simulation 
evaluations. 



Figure 2. Quantifying the computational 
noise for a deterministic simulator across 
2,049 nuclei. 



where n parameter values must be determined from a set of data of o observables. Challenges 
in solving this problem include the computational expense of, and the noise resulting from, the 
iterative calculations performed when simulating the theoretical observable s(0j; x), and the fact 
that derivatives of some simulated observables with respect to certain parameters Xj may not 
be available (or even exist) for use by an optimization algorithm. 

As part of SciDAC efforts, we have developed POUNDERS (Practical Optimization Using 
No DERivatives for Sums of squares), an algorithm for derivative- free optimization of nonlinear 
least-squares problems such as 0. A key benefit of POUNDERS is that it works with the 
individual residuals (di — s(9f,x))/ai rather than the aggregated fit function f(x). As a result, 
POUNDERS can take advantage of the availability of the derivatives of some observables (e.g., 
binding energies) and can approximate nonlinearities in / using simulations at fewer x values. As 
part of the TOPS collaboration, POUNDERS is now available through the open-source Toolkit 
for Advanced Optimization (TAO) |1U| . 

Figure [T] quantifies the computational savings in this ability to exploit the sums of squares 
structure in for a fit to 2,049 binding energies. By working with the residuals, the 
POUNDERS variants obtain far better fits in far fewer evaluations than the analogous variants 
of POUNDER, a similar algorithm that does not have access to the residuals. The warm variants 
illustrate the benefit of using external simulations, done as part of an initial experimental design, 
to warm start the optimization. 

The savings in Figure [T] can be substantial. For the more complex functional optimized in 
each evaluation of / requires 14.4 CPU hours. The resulting parameterization is then 
used to perform a simulation of nuclei across the nuclear table in a calculation requiring 9,000 
processors for more than half a day [9] . 

Mathematical work has also contributed to the sensitivity analysis of nuclear energy 
functionals. Though the simulations are typically deterministic, the aforementioned 
computational noise can obfuscate the number of reliable digits in computed functional 
properties. The ECNoise algorithm described in [12] estimates a standard deviation- like quantity 
using only a few simulations. Figure [2] illustrates the relative noise in the computed binding 
energies for 2,049 nuclei with the parameterization obtained from the POUNDERS optimization 
in Figure [T] Estimates of the noise can, for example, reveal limitations on the predictability of 
computed functional observables and can enable stable approximations of the noisy derivatives 
needed for sensitivity analysis. 



4. Conclusions 

Our work on eigenvalue calculations has made several impacts on nuclear structure calculations. 
Earlier collaborations with nuclear physicists led to significant improvements to an eigensolver for 
configuration interaction calculation, which was subsequently used in predicting the properties 
of 14 F before the isotope was observed experimentally. The work described in this paper takes 
configuration interaction calculation one step further. It enables our physics collaborators to 
efficiently compute energy states of a nuclei with a prescribed total angular momentum instead 
of computing many energy states and identifying those corresponding to the prescribed total 
angular momentum. 

The optimal parameters we have delivered to our physics collaborators have resulted in 
realistic functionals that are now being explored by a variety of groups outside of the UNEDF 
collaboration (as evidenced in the most recent JUSTIPEN conference ( |http : //massexplorer .~| 
|org/justipen/ index ,php[ )). For example, our current results show remarkable power for 
predicting fission barrier heights, which is a first step toward a microscopic understanding of 
fission. These results are a consequence of including a richer set of experimental data and more 
free parameters, resulting in problems that can be solved only by an efficient, state-of-the-art 
optimization algorithm. 
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